We present a novel method for hierarchical topic detection where topics areobtained by clustering documents in multiple ways. Specifically, we modeldocument collections using a class of graphical models called hierarchicallatent tree models (HLTMs). The variables at the bottom level of an HLTM areobserved binary variables that represent the presence/absence of words in adocument. The variables at other levels are binary latent variables, with thoseat the lowest latent level representing word co-occurrence patterns and thoseat higher levels representing co-occurrence of patterns at the level below.Each latent variable gives a soft partition of the documents, and documentclusters in the partitions are interpreted as topics. Latent variables at highlevels of the hierarchy capture long-range word co-occurrence patterns andhence give thematically more general topics, while those at low levels of thehierarchy capture short-range word co-occurrence patterns and give thematicallymore specific topics. Unlike LDA-based topic models, HLTMs do not refer to adocument generation process and use word variables instead of token variables.They use a tree structure to model the relationships between topics and words,which is conducive to the discovery of meaningful topics and topic hierarchies.
展开▼